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(54) Method for serial analysis of gene expression 



(57) Sena! analysis of gene expression SAGE, a 
method for the rapid quantitative and qualitative analy- 
sis cr transcripts is orcvided Short defined sequence 
tags cor r esponding to expressed genes are isolated and 
analyzed Sequencing of over 1.000 aefined tags in a 



short period of time (e g hours) reveals a gene expres- 
sion pattern characteristic of the function of a eel! or tis- 
sue Moreover SAGE is useful as a gene discovery tool 
for the identification and isolation of novel sequence 
tags corresponding to novel transcripts and genes 
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Description 

This invention was made with support from National institutes of Health Grant Nos CA57345 CA35494 and 
GM07309 The Government has certain rights in this invention 

This application is a continuation-in-part application of Serial No 03'527 154 filed September 12. 1995 

Field of the Invention 

The present invention relates generally to the field o' gene expression and specifically to a method for the serial 
analysis of gene expression (SAGE) for the analysis of a large number o* transcripts by identification of a defined 
region of a transcript which corresponds to a region of an expressed gene 

Background of the Invention 

Determination of the genomic sequence of higher organisms, including humans, is now a real and attainable goal 
However, this analysis only represents one level of genetic complexity The ordered and timely expression ot genes 
represents another level o- complexity equally important to the definition and biology of the organism 

The role of sequencing complementary DNA (cDNA) reverse transcribed from mRNA. as part of the human ge- 
nome project has been debated as proponents of genomic sequencing have argued the difficulty of finding every mRNA 
expressed in all tissues, cell lypes and developmental stages and have pointed out that much valuable information 
from intionic and intergemc regions, including control and regulatory sequences, will be missed by cDNA sequencing 
(Report of the Committee on Mapping and Sequencing the Human Genome. National Academy Press. Washington. 
D.C. : 1988i. Sequencing of transcribed regions of the genome using cDNA libraries has heretofore been considered 
unsatisfactory Libraries of cDNA are believed to be dominated by repetitive elements, mitochondrial genes, nbosomal 
RN A genes, and other nuclear genes comprising common or housekeeping sequences. It is believed that cDN A libraries 
do not provide all sequences corresponding to structural and regulatory polypeptides or peptides fPutney. etal Nature 
302:718 1983) 

Another drawback of standard cDNA cloning is that some mRN As are abundant while ethers are rare. The cellular 
quantities of mRNA from various genes can vary by several orders of magnitude 

Techniques based on cDNA subtraction or differential display can be quite useful for comparing gene expression 
differences between two cell types (Hedrick et a! . Nature 308: 149. 1984; Liang and Pardee. Science. 257 967. 
1992). but provide only a partial analysis, with no direct information regarding abundance of messenger RNA The 
expressed sequence tag (EST) approach has been shown to be a valuable tool for gene discovery (Adams, et a!., 
Science 252 1656. 1991: Aaams. era/. Nature, 3b5:632. 1992: Okubo etal. Nature Genetics. 2: 173. 1992). but like 
Northern blotting. RNase protection and reverse transcriptase-polymerase chain reaction (RT-PCR) analysis (Alwine. 
era/.. Proc. Natl Acad Set, US A, 74:5350. 1977: Zmn etal., Cell, 34:865. 1933: Veres etal, Science, 237 415^ 
1987). only evaluates a limited number of genes at a time. In addition, the EST approach pre f erably employs nucleotide 
sequences of 1 50 base pairs or longer for similarity searches and mapping 

Sequence tagged sites (STSs) (Olson, et al. , Science. 245: 1 434. 1 989) have aiso been utilized to identify genomic 
markers for the physical mapping of the genome These short sequences from physically mapped clones represent 
uniquely identified map positions in the genome In contrast, the identification of expressed genes relies on expressed 
sequence tags which are markers for those genes actually transcribed and expressed in vivo. 

There is a need for an improved method which allows rapid, detailed analysis of thousands of expressed genes 
for the investigation of a variety of biological applications, particularly for establishing the overall pattern of gene ex- 
pression in different cell types or in the same cell type under different physiologic or pathologic conditions Identification 
of different patterns of expression has several utilities, including the identification of appropriate therapeutic targets 
candidate genes for gene therapy (e.g., gene replacement) tissue typing, forensic identification, mapping locations of 
disease-associated genes and for the identification of diagnostic and prognostic indicator genes 

50 SUMMARY OF THE INVENTION 

The present invention provides a method for the rapid analysis of numerous transcripts in order to identify the 
overall pattern of gene expression in different cell types or in the same cell type under different physiologic, develop- 
mental or disease conditions The method is based on the identification of a short nucleotide sequence tag at a defined 
55 position in a messenger RNA The tag is used to identify the corresponding transcript and gene from which it was 
transcribed. By utilizing dimenzed tags, termed a "ditag". the method of the invention allows elimination of certain types 
of bias which might occur during cloning and/or amplification and possibly during data evaluation Concatenation of 
these short nucleotide sequence tags allows the efficient analysis of transcripts in a serial manner by sequencing 
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multicle tags or s - •: 2NA molecule fc example a 2 \ A moiee ulo .-• " a vectc c m d s^d-c core 

The meth ne^rn^e:: here ^ is inn sen,-ii analysis of gene express if/- • ? A o F nove' approach wn: r a imvs 
the analysis a a'ae numpe- o f transcripts Tc dcncnsUdie m:s s:r rt teg\ shor cDNA sequence lags we r c generated 
from mRNA soiateo fr or : canceas randomly c^o:: tc re- ::.:-::s conc^te^atec: a-a donee M^u^ sequencing 
5 of 1 uOC tags reveled n gene expression pauem character t c d pa-ceax 'mete aent.fcalior sum oar.ems 
is important di^pnost.calty and therapeutical!; forexampc Mo-eover the use of 2A3E as a cene d seovery tco : was 
documented cy the dentification ard isolation d new pancreatic ransenpts cor re soondinc to novel tags SAGE pro- 
vides a orcadly applicable means fo f the qjarv.tst ve cataloging ard comparison o 1 expressed genes in a v.a-ieiv - f 
norma: developmental ard disease slates 

BRIEF DESCRIPTION OF THE DRAWINGS 

F-GURE 1 snows a schematic o ; SAGE The first restriction enzyme or anchoring enzyme is Nlalll and the second 
enzyme or tagging enzyme s Fok in this example Secuences represent prime- derived sequences and transcript 
derived sequences wit- ,l X' and "C representing nucleotides of different tags 

FIGURE 2 snows a compar son o* transcnp- acundancc Bars represent 'he percent abundance as determined 
by SAGE (daK bars, or nyonoization analysts fhglr bars; SAGE quantitations were derived trom Table 1 as follows 
TRY" ; 2 mcluaes the tags '"or trypsinogen 1 and 2 PPOCAR ind cates tags tor procarboxypeptidase Al CHYMO in- 
dicates :ags for chymotrypsinogen and ELA PRO inc udes the tags fcr elastase 1MB and protease E Error bars reo- 
resent tne standard deviation determined by taking the square root of countec events anc converting it to a pe'eent 
abundance (assumed p uisso" dtstr ibudor u 

FIGURE C- shows ne results of screening a cDNA library with 3A3E tags Pi and P2 show typical hybridization 
results obtained with 1 3 bp o igonucleotioes as described in the Examples Pi and P2 correspond to the transcripts 
described n Tabic 2 Images *vcre obtained using a Mo'ecu^n Dynamics Pbospncnmager anc the circle indicates the 
25 outline of the tdtcr membrane to which the recombinant phage were transferred prior to hybridization 

FIGURE - is a b!o:k caciam of a tag code database access system in accordance with the present invention 



DESCRIPTION OF THE PREFERRED EMBODIMENTS 

30 The present invention provides a -apid quantitative process *or aetermimng the abundance and nature of tran- 

scripts corresponding to expressed genes The method, termed serial analysis of gene expression (SAGE) is based 
on the identification of and characterization of partial defined sequences of transcripts corresponding to gene seg- 
ments "hese defined transcript sequence "tags" a-e markers fc genes wh:ch are expressed in a cell a tissue or an 
extract for example 

35 SAGE is basad on several principles First a short nucleotae sequence tag (9 to 10 bo) contains suff.cient infor- 

mation content tc uniquely identify a transcript provided it is isolated *'iom a defined position within tne transcript For 
example a sequence as snort as 9 bp can distinguish 262 144 transcrip-s (4 G ; given a random nucleotide distribution 
at the tag site, whereas estimates suggest that the human genome encodes about 50 000 to 200.000 transcripts Fields 
etal.. Nature Genetics 7 345 1 994) The size of tne tag can be shorter for lower euKaryotes o r prokaryotes for example 
where the number of transcripts encoded by the genome is lower For example a tag as short as 6-7 op may be 
sufficient fcr distinguishing transcripts in yeast 

Second, ranoom dimerization of tags aliows a procedure for recucing bias (caused by amplification and/or donna) 
Third concatenation of these short sequence tags allows the efficient analysis of transcripts in a serial manner by 
sequencing multiple tags within a single vector or clone As with senal communication by computers wherein mfor- 
mation is transmuted as a continuous string of data serial analysis of the sequence tags recuires a means to establish 
the register and boundaries of each tag All of these principles may be applied independently in combination, or in 
combination with other known methods of sequence identification 

In a first embodiment, the invention provides a method lor the detection of gene expression m a particular cell or 
tissue oi cell ext-act foi example, including at a particular developmental stage or in a particular disease state The 

so method comprises produc ng complementary deoxyribonucleic acid {cDNAj oligonucleotides isolating a f rst defined 
nucleotide sequence tag trom a first cDNA oligonucleotide and a second defined nucleotide sequence tag from a 
second cDNA oligonucleotide linking the first tag to a first oligonucleotide linker wherein the first oligonucleotide linker 
comprises a first sequence fc hybridization of an amplification primer and linking the second tag :o a second oiigonu- 
cleot de linker, wherein the second oligonucleotide (inker comprises a second sequence for hybridization of an ampli- 

55 fication primer and Determining the nucleotide sequence of thetag(s) wherein the tag(s) correspond to an expressed 
gene 

Figure 1 shows a schematic representation of the analysis of messenger PNA (mRNA) using SAGE as described 
in the method of the invention mRNA is isolated from a cell or tissue of interest for in vitro synthesis o*' a double- 
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st 'Hnded DNA sequence by reverse transcr iption of the mRNA The double-stranded DNA compomcnt ol rnRNA 
formed is referred to hs complementary (cDNA) 

The term 'oligonucleotide* 1 as used herein refers to primers :r oligomer fragments comprised c*' two or more de- 
oxyriDonuciootides or ribonucleotides preferably more than three The exact si/e will depend on mhny tactors which 

5 in turn depenc on the ultimate I unction or use o* the oligonucleotide 

The method 'urther includes .iqatinq the first tag linked to the first o.igcnucleotide linker to the second tag linked 
to the second oligonucleotide linker and forming a "ditag" Each ditag represents two defined nucleotide sequences Df 
at least one transcript, representative of at least one gene Typicahy a ditag ropie&ents two transcripts from two distinct 
genes "^hc presence of a defined cDNA tag within the ditac is indicative of expression of a gone having a sequence 

io of that tag 

The analysis of ditags formed prior to any amplification step provices a means to eliminate potential distortions 
introduced by amplification e.g.. PCR The pairing of tags for the formation of ditags is a rancom event The number 
of different tags is expected to be targe, therefore, the probability of any two tags being coupled in *he same ditag is 
smal . even for abundant transcripts Therefore repeated ditags peter tiaNy produced by biased standard amplification 
^ and/or cloning methods are excluded from analysis by the methoa of the invention 

The term "defined" nucleotide sequence or "defined" nucleotide sequence tag refers to a nuc eotide sequence 
derived from either the 5' or 3' terminus of a transcript The secuence is defined by cleavage with a first restriction 
endonuclease and represents n jcleotides either 5' or 3' of the first restriction enaonuclease site, depending on which 
terminus is used for capture (e.g.. 3' wnen oligo-dT is used for capture as describee herein) 
20 As used herein, the terms "restriction endonucleases" and "restriction enzymes'" refer to bacterial enzymes which 

bind to a specific double-stranded DNA sequence teimed a recognition site oi recognition nucleotioe sequence and 
cut double-stranoed DNA at or near the specific recognition site. 

The first endonuc lease, termed "anchoring enzyme" or "AE" in Figure 1 . is selected by its ability to cleave a tran- 
script at least one time and therefore produce a defined sequence lag from either the 5' or 3' end of a transcript 
^ Preferably a restriction cnoonudcasc having at least one rccognmon site and therefore having the ability to cleave a 
majority of cDNAs is utilized For example as illustrated herein enzymes which have a 4 base pa r recognition site 
are expected to cieave every 256 base pairs (4 4 ) on average wnile most transcripts are considerably larger Restr iction 
endonucleases which recognize a 4 base pair site include Nlalli as exemplified in the EXAMPLES of the present 
invention Other similar endonucleases having at least one recognition site within a DNA molecule (eg cDNA) will 
?<> be known to those of skill in the art (see for example. Cur rent Protocols in Molecular Biology. Vol 2. 1 9^ Ed Ausubel. 
et al.. Greene Publish. Assoc & Wiiey Interscience. Unit 3.1.15. New England Biolabs Catalog. 1995). 

After cleavage with the anchoring enzyme the most 5' or 3' region of the cleaved cDNA can then be isolated by 
binding to a capture medium For example, as illustrated in the present EXAMPLES, streptavidin beads are used to 
isolate the defined 3' nucleotide sequence tag wnen the ongo dl primer for cDNA synthesis is biotinylateo In this 
^ example cleavage with the first or anchoring enzyme provides a Lmque site on each transcript which corresponds to 
the restriction site located closest to the poly-A tail Likewise, the 5' cap of a transcript (the cDNA) can be utilized for 
labeling or binding a capture means for isolation of a 5' defined nucleotide sequence tag Those of skill in the art will 
know other similar capture Systems {e.g.. biofin.'streptavidin. digoxigenin/anti-digoxigenin) for isolation of the defined 
sequence tag as described herein 

The invention is not limited to use of a single "anchoring 11 or first restriction endonuclease. It may be desirable to 
perform the method of the invention sequentially, using different enzymes on separate samples of a preparation, in 
order to identify a complete pattern of transcription for a cell or tiss je In addition the use of more than one anchoring 
enzyme provides confirmation of the expression pattern obtained from the first anchoring enzyme Therefore, it is also 
envisioned that tne first or anchoring endonuclease may rarely cut cDNA such that few or no cDNA representing 
abundant transcripts are cleaved Thus, transcripts which are cleaved represent 'unique" transcripts. Restriction en- 
zymes that have a 7-8 bp recognition site for example, would be enzymes that would rarely cut cDNA Similarly, me'e 
than one tagging enzyme, described below can be utilized in order to identify a complete pattern of transcription 

The term "isolated" as used heiem includes polynucleotides substantially free of other nucleic acids, proteins 
lipids carbohydrates oi other materials with which it ;s naturally associated cDNA is not naturally occurring as such 
but rather is obtained via manipulation of a partially purified naturally occurring mRNA. Isolation of a defined sequence 
tag refers to the purification of the 5' or 3' tag from other cleaved cDNA. 

In one embodiment, the isolated defined nucleotide sequence tags are separated into two pools of cDNA, when 
the linkers have different sequences Each pool is ligated via the anchoring, or first restriction endonuclease site to 
one of two linkers When the linkers have the same sequence, it is not necessary lo separate the tags into pools The 
first oligonucleotide linker comprises a first sequence for hybridization of an amplification primer and the second oli- 
gonucleotide linker comprises a second sequence for hybridization of an amplification primer, in addition, the linkers 
further comprise a second restriction endonuclease site, also termed the "tagging enzyme" or TE" The method of the 
invention does not require but preferaely comprises amplifying the ditag oligonucleotide after ligation 
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T r.c secono resinr;.?^ onaoruc lease cionves a: a sec a:s;ant from c outside of tne 'ecognmon site -o- exrimpie 
Me s^cona resfctior" ennonttrieHse enr oe - typo MS ros ,r iC.non on/vme Typo iiS restriction enoon^cionsos nodve 
at a defines distance up to 2C oc away f r om the;, ^sy ^tovi; recognition sues ^ S/yeaisk W Go.-o £C "5^ H °55 
Examples o' typo i : S -os" .-.to- endonucieascs ncuac Bs-r" dna Ctne- simiia' enzymes w w oo kno\v :c t^ose 
f of sk IS in tne an .see Cunen: PfOlccos in Molecular BiO'ca> supra ; 

The first and seconc "linkers' which are nailed ic the aef-nea nucleotide sequence tags are ohcon jcieot aes naving 
tne same or different nucleetioe sequences For example t ne linkers ill jsfated in the Examples of tne present invention 
include inkers having eiffcert sequences 

5'-TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG -3* 
(SEQ ID NO: 1) 

3'- ATX3GTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT -5* 
(SEQIDNO:2) 

and 

5'- TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3' 
(SEQ ID NO:3) 

3'- AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -5' 
(SEQ ID NO:4), 

35 wherein A is a dideoxy nucleotide (e g dideoxy A) Other similar linkers car be utilized in the method of the invention 

those of skill in the art can design such alternate Sinkers 

The linkers are designed so that cleavage of the ligation products with the second restriction enzyme or tagging 

enzyme results in release of the linker havmga defined nucleotide sequence tag (e.g.. 3' of the restriction endonuoease 

cleavage site as exemplified he-em) The defined nu:leotide sequence tag may be from about 6 to 30 base pairs 
-to Preferably the tag is about 9 to 11 base oairs Therefore a ditag is from about 12 to 60 base pairs, and preferably 

from 1 6 to 22 oase pairs 

The pool of defined tags Heated to linkers having the same sequence or tne two pools of defined nucleotide 
sequence tags heated to (inkers having different nucleotide sequences are randomly hgated to each other "tail to tail". 
The ponior of the cDNA tag furthest irom the linker is referred to as the "tail" As illustrated ir FIGURE 1 the hqated 
tag pair or ditag has a first restnenon enconjclease site upstream (5 ) and a first restriction endonuclease site oowrv 
sream (3') of the ditag a second restr cfon enoonue'ease cleavage site upstream and downstream of the ditag and 
a linker oligonucleotide containing both a second restriction enzyme recognition site and an amplification primer hy- 
bndizahon site upstream and downstream of the diiag In other weds the ditag is flanked by the first restriction endo- 
nuclease site the second 'estnciion enoonue'ease cleavage site and the linkers respectively 

so The ditag can be amplified by utilizing primers which specifically hybridize to one strand of each linker Preferably, 

the amplification is performec by standard polymerase chain reaction i'PCR) methods as described (U S Patent No 
4.563 195) Alternatively, the ditags can be amplified by cloning in prokaryotic-compatible vectors or by other ampli- 
cation methods known to those of skill in the art 

The term "primer" as usee heron refers to an oligonucleotide whether occurring naturally or produced synthetically 

55 which is capable of acting ss a point of initiation of synthesis when placed under conditions in which synthesis of pnmer 
extension product wmeh >s complementary to a nucleic acid strand is -nd-jced re m the presence of nucleotides and 
an agent for polymerization such as DNA polymerase and at a suitable temperature and pH The primer is preferably 
single stranded for mavimumi efficiency in amplification Preferably the primer is ar oligodeoxy ribonucleotide The 
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pnrmer must bo sufficiently long to prime the synthesis of extension products in the presence of :he agent lot polym- 
erization The eXrHCt lengths cf the primers w II eepend on mnny factors including tempcra;ure nn-1 sourer of pr ,r ner 
The primers nerein are selected to be "substantially" complementary to the different strands rf each specific se- 
auencc :o be amplified This ne^ns th-t the primers must be suliciently complerricniary tc hybudi/e w.th their respe:- 

^ live strands Therefore the primer secuence need net reflect the exact sequence o*' the template 1 n the present in- 
vention the primers are substantially complementary to the Dligonuc leotide linkers 

F' rimers useful tor amplification of the linkers exempt it icd heiein as SEQ I D NO 1 -c include 5'-C J-AGCTTATTCAAT- 
TCG'jT:::C-3' (SEO ID NO 5 and 5'-GTAG AC ATTCTAGTA^ITCG ^-3' ;SEG ID NO 6) Those o' skill in the ar. can 
prepare similar primers for amplification based on the nucleotide sequence of the linkers without undue experimenta- 

io tion 

Cleavage of the amplified PCR product with the first restriction' endonuclease allows isolation of ditags which can 
bo concatenated by ligation After ligaton it may be desirable to clone the concatemers althcugn it is not required in 
the me! hoc of the invention Analysis of the ditags or concatemers whether o r not amplification was performed, is by 
standard sequencing methods Conca:emers general'y consist of aoout 2 to 200 ditags and preferably from about 8 
to 20 ditags While these are preferrec concatemers :t will be apparent that the number of ditags which can be con- 
catenated will depend on the length of the individual tags and can be readily determined by those of skill in the art 
without undue experimentation After formation ot concatemers multiple tags can be cloned into a vecto' tor sequence 
analysis or alternatively ditags or concatemers can be directly sequenced without cloning by methods known to those 
of skill m the art 

20 Among the standard procedures for cloning the defined nuc : eohde sequence tags of the invention is insertion of 

the tags into vectois such as plasmids oi phage The ditag oi concatemeis of ditags produced by the method described 
herein are donee into recombinant vectors for further analysis e.g. sequence analysis, plaque/piasmid hybridization 
using the tags as probes, by metnods Known to those of skil m the art 

Tne term "recombinant vector" refers to a piosmid virus cr other vehicle known in the an that has been manipulated 

2S by insertion or incorporation of the ditag genetic sequences Such vectors contain a promoter sequence which facilitates 
the efficient transcription of the a marker genetic sequence tor example. The vector typically contains an origin of 
replication a promoter as well as specific genes which allow phenotypic selection of ihe transformed cells Vectors 
suitaole tor use in the present invention include for example pBiueScrpt ;Stratagene. La Jolla CAi: pBC pSL301 
(Invitrogen; and other similar vectors known to those of ski! in the art Preferably The ditags or concatemers thereof 

30 are ligaled into a vector for sequencing purposes 

Vectors in which the ditags are cloned can be transferred into a suitable host celi. "Host cells" are cells in which a 
vector can be propagated and its DNA expressed The term also includes any progeny of the subject host cell It is 
understood that all progeny may not be identical to the parental cell since '.here may be mutations that occur during 
replication However such progeny are included when the term "host cell' .s used Methods of stab'e transfer, meaning 

35 that the foreign DNA is continuously maintained in the host, are known in the art. 

Transformation of a host cell with a vector containing ditag(s) may be carried out by conventional techniques as 
are well known to those skilled in the art. Where the host is prokaryotic. sjch as E. coli, competent cells which are 
capaole of DNA uptake can be prepared from cells harvested after exponential growth phase and subsequently treated 
by the CaCU method using procedures well known in the art. Alternatively. MgCU or RbCl can be used Transformation 

JO can also be performed o-y electroporation or other commonly used methods in the art 

The ditags present in a particular clone can be sequenced by standarc methods (see for example Current Protocols 
tn Molecular Biology, supra. Unit 7) either manually or using automated memods 

h another embodiment, the present invention provides a kit useful for detection of gene expression wherein the 
presence of a defined nucleotide tag or ditag is indicative of expression of a gene naving a sequence of the tag the 

^5 kit comprising one or more containers comprising a first container containing a first oligonucleotide linker having a first 
sequence useful hybridization of an amplification primer a second container containing a second Oligonucleotide linker 
having a second oligonucleotide linker having a second sequence useful hybridization of an amplification primer where- 
in the linkers further comprise a (estriction endonuclease site for cleavage of DNA at a site distant from the restriction 
endonuciease recognition site, and a third and fourth contains! having a nuc ieic acid prime's for hybridization to the 

50 first and second unique sequence of the linKer. It is apparent that if ihe oligonucleotide Linkers comprise the same 
nucleotide sequence, only one container containing linkers is necessary in the kit of the invention 

In yet another embodiment the invention provides an oligonucleotide composition having at ieast two defined 
nucleotide sequence tags, wherein at least one of the sequence tags corresponds to at least one expressed gene 
The composition consists of about 1 to 200 aitags. and preferably about S to 20 ditags Such compositions are useful 

55 for the analysis of gene expression by identifying the defined nucleotide sequence tag corresponding to an expressed 
gene in a cell, tissue or cell extract, for example. 

It is envisioned that the identification of differentially expressed genes using the SAGE technique of the invention 
can be used in combination with other genomics techniques For example individual tags, and preferably ditags. can 
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r-yDnziizec are emferady un abelec and tne ditaq is preferably detectaoiy labeled Alternatives tno ohoonucieoticc 
Cr.n be labeled ratne r man tne ditaq Tno aitaas zar oc oeiectapiy abeled for example wiK. r aaioisotopc a fluores- 
cent compound a biclummescent comcouna a chemHumiinescent compound a metal chelator c r .-in enzyme T nose 



-* canary skill in the an wiM know cf other suitable labois to r binding to tre d tag or wi! 1 oe aoie i: aseertair sucr 
o&inc routine experimentation For example PCS can do performed with labelec \ eg fluorescein tagged t primers 
~ refer ab'y tne dr.ag contains a fluorescent end label 

Tne labeted or unlabeled ditags are separated nto single-stranded molecues which are preferably ser ally diluted 
ana adced to a solid support tep a Silicon eh p as described by Fodor e: a' . Science 25 ■ 767 1 9°i ■ containing 
oi gonucleotides representing fcr example eve r y possib e permutation of a 10-mer tea in each grid of a chip ( The 
T - soiiQ support is then used *o determine differentia! expression of the tags contained within tnat support \e c on a grid 
on a chip) by hybrid zatior cf the oligonucleotides on the solid support with tags produced from cells under different 
cor a tions te a . different s;age of development growth ot ce is .n the absence ana presence ot a growth factor normal 
versus transformed eels comparison of different t ssue expression etc) !n the case of fluorescemated end labeled 
orags araiys-s cf fluorescence is indicative of hycidi/at on to a particular J 0-me r When tne immobilized oligonucle- 
-<-"■' oliae is 'lucre see na.ed for example a loss of fluorescence due to quenching (py :he proximity of the hybridized ditag 
to the lc:be>ed oligo) ib ob^ei^eJ dnd s analyzed fui ft it- pattern of gepe expression Ari illustrative example of the 
method is shown in Example 4 herein 

Tne SAGE method of the mventior is also usef jI for clonal secuencing sim-la r to limittrg cilution techniques used 
:n clomrc o* ceil lines For example ditaas or concatemero thereof arc di uted and acded to mdividua 1 receptacles 
3uch that cacti receptacle contamc I esc than ore DNA molecule per receptacle DNA in each receptacle ic amplified 
.and secuenced by standard methods known in the art including mass spectroscopy Assessment of differential ex- 
pression is peiormed as described above for SAGE 

Those of skit in the art can readily determine c'her methods of analysis fcr d tags or individual tags produced by 
SAGE as described in the present invention without resorting to undue experimentation 
30 The concept of deriving a del nod tag from h sequence in accordance with the present invention is useful in matching 

tags of samples to a sequence database in the preferred embodiment a computer rrethod is used to match a sample 
seauence with known sequences 

in one embodiment a sequence tag for a sample is compared to corresponding information n a sequence database 
to identry known sequences tha* match the sample seauence One or more taps can be determined for each sequence 
m the seauence database as the N base pairs adjacent to each anchoring enzyme site with n the secuence However 
m the preferred embodiment only the first anchoring enzyme site from the 3' end is used to determine a tag In the 
preferred embodiment the adjacen: base pairs defining a tag are on the 3' siae cf the anchoring enzyme site and N 
is preferably 9 

A linear search through such a database may be used Howeve^ in the preferred embodiment a sequence tag 
'com a sample is converted to a unique numeric representation by converting each base pair (A C G or Ti of an N- 
oase tag to a number o' "tag code" ie p . A-C C=1 G- 2 J-3 or any other suitable mappirg'i A tag is determined for 
each sequence o' a sequence database as described above and the tag is converted to a tag code in a simi.ar manner 
in tne preferred embodiment a set of tag codes for a sequence dataoase is stored in a pointer file The tag cooe for 
a sample sequence is compared to the tag codes m the pointer tile to determine the location in the sequence database 
C't the sequence corresponding to the sample tag code (Multiple corresponding sequences may exist it the sequence 
database has recundancies) 

FiGURE ~ is a block ciagram of a tag code database access system in accordance with the present invention A 
sequence database 'Oieq the Human Genome Sequence Database) is processed as cescribed above such that 
eac h sequence has a tag code deter mined and stored in a pointer file 1 2 A sar npie tag code Xfor a sample is determined 

so as desenbed above and stored within a memory location 14 cf a computer The sample tag code X is compared to 
the pointer file 12 for a matching sequence tag code 11 a match is found a pointer associated with the matching 
sequence tag code is used to access tne corresponding sequence in the sequence database 10 

The pointer file 12 may be in any of sevcra formats In one format, each cnt'y of the pointer file 12 comprises a 
tag code and a pointeMo a coresoonding record in the sequence database 1 2 The sample tag code Xcan be compared 

5* to sequence tag codes in a linear seaxh A'temativeiy the sequence tag cooes can be sorted and a bma r y search 
used As another alternative the sequence tag codes can be structured m a hieramhiea tree structure (e g. a B-tree) 
or as a singly or doub'y linked list or in any other conveniently searchable data structure or format 

In the preferred embodiment, each entry of the pointer f le 1 2 comprises only a pointer to a corresponding record 
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in the sequence database 10 In building :he pointer tile "2 each secuencc ta^ code is assignee: to an entry position 
in the pointer *ile 12 eo r rospondmg to 'he value ol the tarj code For example tf h sequence :ag code was "1043" a 
pointer to the corresponding record in the sequence database 10 would be stored in entry I'A 043 of tno oomter tile 12 
The value of h sample *.ag code X can be used to directly address the location in the pointer die 12 that corresponds 
^ to the sample tag code X and thus "apidly access the pointer s:ored in that location in order to address the sequence 
database 10 

Because only four values are needed to represent all possible base oars using binary coded decimal (BCD) 
numbers tor tag codes in conjunction with the preferred pointer file 12 structure leads to a "sparse" pointer file 12 that 
wastes mcmcy or storage space Accordingly the present invention transforms each tag code to number base 4 ;/ 
>0 e 2 bits per code digit) in known fashion resjlting in a compact pointer file 12 structure For example for tag sequence 
"AGCT" with A~00 2 C~0i 2 G-10 2 T= r 2 the base four representation in binary would be "0001 10 V" In contrast 
the BCD representation would be "00000000 00000001 00000010 000000011" Of course it shou'd be understood 
that other mappings of base pairs to codes would provide equivalent function 

The concept of deriving a defined tag from a sample seguence in accordance with the present invention is also 
1 $ useful in comparing different samples for similarity in the preferred embodiment, a computet method is used to match 
sequence tags from different samoles For example, in comparing materials having a large number of secuences (e 
g., tissue), the frequency of occurrence o f the various tags in a first sample can be mapped out as tag codes stored 
in a distribution or histogram-type data structure For example, a table structured similar tc pointer file 1 2 in FIGURE 
4 can be used where each entry comprises a frequency of occurrence value Thereafter the various tags in a second 
?o sample can be generated converted to tag codes and compared to the tabie by airectly acdressing table entries with 
the tag code A count can be kept of the number o 4 matches found, as well as the location of the matches, for output 
in text or graphic form on an output device and/or tor storage in a data storage system for later use 

The tag comparison aspects of the invention may be implemented in hardware or software, or a combination of 
both Preferably these aspects of the invention are implemented in computer programs executing on a programmable 
2S computer comprising a processor a data storage system (including volatile and non-volatile memory and'or storage 
elements), at least one input device and at least one output device. Data input through one or more input devices tor 
temporary or permanent storage in the data storage system includes sequences and may include previously generated 
tags and tag codes for known and/or unknown sequences Program code is applied to the input data to perform the 
functions described above and generate outout information The output information is applied to ore or more output 
30 devices in known fashion 

Each such computer program is preferably stored on a storage media or device le.p. ROM or magnetic diskette) 
readable by a general or special purpose programmable computer, for configuring and operating the computer when 
the storage media or device ts read by the computer to perform the procedures described herein The inventive system 
may also be considered to be implemented as a computer -readaoie storage medium configured witn a computer 
35 program where the storage medium so configured causes a computer tc ope r ate in a specific and predefined manner 
to perform the functions described herein 

The following examples are intended to illustrate but not limit the invention. Wh:ie they are typical of those that 
might be used, other procedures known to those skilled in the art may alternatively be used. 
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For exemplary purposes, the SAGE method of the invention was used to characterized gene expression in the 
human pancreas Nlalll was utilized as the first restriction endonuclease. or anchoring enzyme, and BsmFI as the 
second restriction endonuclease. or tagging enzyme, yielding a 9 bp tag (BsmFI was predicted tc cleave the comple- 
mentary strand 14 bp 3' to the recognition site GGGAC and to yield a 4 bp 5' overhang (New England BioLabs) 
Overlapping the BsmFI and Nlalll (CATG) sites as indicated (GGGACATG) would be predicted to result in a 11 bp tag 
However, analysis suggested that under the cleavage conditions used (37 3 C) BsrnFI often cleaved closer to its rec- 
ognition site leaving a minimum of 12 bp 3' of its recognition site. Therefore, only the 9 bp closest to the anchoring 
enzyme site was used for analysis of tags Cleavage at 65°C results in a more consistent 11 bp tag. 

Computer analysis of human transcripts from Gen Bank indicated that greater than 95% of tags of 9 bp h length 
were likely to be unique andthat inclusion of two additional bases provided little additional resolution Human sequences 
(84 300) were extracted from the GenBank 87 database using the Findseq program provided on the intelNGenetics 
Bionct on-line service All further analysis was performed with a SAGE program group written in Microsoft Visual Basic 
for the Microsoft Windows operating system The SAGE database analysis program was set to include only sequences 
noted as "RNA" in the locus description and to exclude entries noted as "EST" resulting in a reduction to 13,241 
sequences Analysis of this subset of sequences using Nlalll as anchoring Enzyme indicated that 4.127 nine bp tags 
were unique while 1 ,51 1 tags were found tn more than one entry Nucleotide comparison of a randomly chosen subset 
(100) of the latter entries indicated that at least 83% were due to redundant data base entries for the same gene or 
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EXAMPLE I 

As outlined a bo ye mRNA frorr nu^dn pancreas was used tc generate ditags Briefly five jg mRNA from tota ! 
panceas f 3lontech > was converted to double stranded cDN A using a BR.. cDN A synthesis m! following tne rranjfac- 
turer's oroioccl using the prime biotm-5'T ia -3 The cDNA was then cleaved witn Nlalh ana tne 3 restriction fragments 
isolated by binding to magnet c streptavidm beads (Dynaii The bounc DNA was d viced into two pools and one of the 
following linkers ligated to each pool 

S'-TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG -3' 
3'- ATGGTCGAATAAGTTAAGCCAGGAGACjCGTGTCCCT -5' 
(SEQ ID NO:l and 2) 



5'- TIT 1TGT AGAC ATTCTAGTATCTCGTC AAGTCGG AAGGGAC ATG -3' 
3 1 - AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -5* 
(SEQ ID NO:3 and 4), 
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where A is a aideoxy nucleotide ie g didaoxy A- 

After e> tensive washing to remove unhgated linkers the linkers and adjacent tags were released by cleavage with 
BsmFI The resulting overhargs were *il!ed in with T4 polymerase anc the pools combined and ligated to each other 
The desired ligation product was then amplified for 25 cycles using 5'-CCAGC~TATTCAATTCGGTCC-3' and 5'-GTA- 
GACATTCTAGTATCTCGT-3' (SEQ ID NO 5 and 6 respectively) as primers The PCR reaction was then analyzed by 
polyacrylamide gel electrophoresis and the desired product excised An additional 15 cycles of P CR were then per- 
formed to generate sufficient product fc efficient ligation and cloning 

The PGR ditag products were cieaved with Nla M and the band containing the ditags was excised and self -ligated 
After ligation the concatenated ditags were separated by polyacrylamide gel electrophoresis and products greater 
than 200 bp were excised These products were cloned into the Son! site of pSL30' (Invr.rogen). Colonies wee 
screened tor inserts oy D CR using T7 and T3 seguences outside the conmg site as primers Clones containing at least 
10 tags (range 10 to 50 tags.) were identified by PCR amplification and manually sequenced as described (Del Sal et 
al . Biotechntques 7:514 1989) using 5'-G ACGTCGACCTG AGGTAATTATAACC-3 (SEQ ID NO 7) as primer Se- 
quence 'lies were analyzed using the SAGE software group which identifies the anchoring enzyme site with the proper 
spacing and extracts the two intervening tags and records them in a catabase The 1 .000 tags were derived from 41 3 
unique ditags and 57 repeated ditags The latter were ony counted once to eliminate potentia PCR bias of the quan- 
titation The function of SAGE software is merely to optimize the search for gene sequences 

Tab^e 1 shows analysis of the first 1 .000 tags Sixteen percent we r e eliminated because they either had sequence 
ambiguities or were derived term linker sequences The remaining 540 tags included 351 tags that occurred once and 
77 tags that were found rrult p'o times Nine of the ten most abundant tags matched at least one entry in GenBank 
R67 The remaining tag was subsequently shown to be derived from amylase All ten transeipts were derived from 
genes of known pancreatic function and their prevalence was consistent with previous analyses of pancreatic RNA 
using conventional approaches (Han el at.. Proc. Natl. Acad Sci USA 53.110. 1966 Takeda etai. Hum Mol Gen.. 
2 1793 1993) 
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TABLE 1 



Pancreatic SAGE Tag s 





IAS. 


Gen.e 


H 


Percent 




GAGCACACC 


Proc^rboxvDeotidase A 1 (X613\%) 


64 


7.6 




TTPTGTGTG 


Pan^r^Jitif* TrvnQitvscy^n 9 f\A77 /^tOy \ 


46 


j . j 


10 


GAACACAAA 


Chyrnotrypsino gen (M24400) 


37 


4 4 




TPAPrGGTGA 

1 v^A\ vj vJ\_J l vjrv 


PanrrfJitir Trvn^in I ChAl^f^M^S 




1 7 




gpotgappa 

VJV^VJ i VJA^v^A 


<L2>l«lo^ HH ' y_IVl J Ov>7^ ^ 


20 


7 d 




OTPtTOTOPT 

vJ 1 VJ 1 v J 1 VJV- I 




i ij 


1 0 








16 




IS 


CC AGAGAGT 




14 


1 7 
1 . / 




TCCTCAAAA 


No Match. See Table 2 PI 


14 


1 .7 




A PtPPTTGGT 

/A.VTVw' I 1 VJVJ J 


Sjjlf ^limnlflt^H T inflow r~V S^Ld ^7^1 








OTGTGPPtPT 
vj i vj 1 vj\_» vjv- i 


1><V) IVUUjLrll 




i .j 




TOPGAOAPP 

i VJV^ VJ/\VJr\.V-V^ 




Q 
y 


1 i 


20 


VJ 1 UAAAV^LL. 


9 1 A lit ^ntr i^g 


Q 
o 


J -VJ 




GGTGAPTPT 




g 


1 .0 




AAGGTAACA 


Secretary Trvosin Inhibitor fMl 1949") 


6 


0.7 




TCCCCTGTG 


No Match 


5 


0 6 




f.Tp.A PP A P G 

VJ 1 VJrvV^\—/\v-VJ 


1N(J xVUtVOJ 


5 






PPTGTA ATP 

V- V- I VJ 1 1 V- 


K/Q1 1 M9Q1^A 1 1 Ahi mtni*^ 

ivi jf i i J7 ( ivi ■/ 2? J l_JV> , 1 i rVIU dllll'wo 


5 


0 6 




PAPGTTGPr A 

ViAV«VJ 1 L VJVJrt. 






0 ft 




AGCCCTACA 


No Match 


5 


0 6 




AGCACCTCC 


Elongation Factor 2 (Zl 1692) 


5 


0 6 




ACGCAGGGA 


No Match, Sec Table 2, P3 






30 


AATTGAAGA 


No Match, See Table 2, P4 


5 


0.6 




TTCTGTGGG 


NoMalch 


4 


0.5 




TTCATACAC 


No Match 








GTGGCAGGC 


NF-kB(X6 1 499), AJu entry (S94 54 1 ) 


4 


0.5 




GTAAAACCC 


TNF receptor II (M55994). 






35 




Aluentiy(X01448) 


4 


0.5 




GAACACACA 


No Match 


4 


0.5 




CCTGGGAAG 


Pancreatic Mucin (J05582) 


4 


0.5 




CCCATCGTC 


Mitochondria! CytC Oxidase (XI 5759) 


4 


0.5 




(SEQ ID NO:8-37) 






40 


Summit rr 










SAGE tags 


Greater than three times 


380 


45.2 




Occurring 


Three times (15x3«) 


45 


5.4 






Two times (32x2=) 


64 


7.6 






One time 


351 




45 




Total SAGE Tags 


840 


1000 



"Tag" indicates the 9 bp sequence unique to each tag. adjacent lo the 4 bp anchoring Nlalll site "N" and "Percent" 
indicates the number of times the tag was identified and its frequency tespectively "Gene" indicates the accession 
number and description of GenBank RS7 entries found to match the indicated tag using the SAGE software group with 
the following exceptions When multiple entries were identified because of duplicated entries, only one entry is listed 
In the cases of chymotrypstnogen. and trypsmogen 1 , other genes were identified that were predicted to contain the 
same tags, but subsequent hybridization and sequence analysis identified the listed genes as the source of the tags 
"Aiu entry" indicates a match with a GenBank entry for a transcript that contained at least one copy of the alu consensus 
sequence (Deininger era/., J Mo/ Biol , 151 :17, 1981) 
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EXAMPLE 2 

The 'j^-- >/ \ c nn:j r c c ; SAGE ;,ns ev^luatea oy cons:mct:c^ c 1 o:iao<i T primed can-re-itic c D N A libra'y 

WMC l .\^S SCOC't?^ Wit 1 "; :2}\A. C'COOS : C r t-VCS - noqe~ 1 2 prOC^-DCXCOOtlOnSO A" /mcfVPS ncOO'" rf*'a OiriStaSC 

:-!'3 oroteasc E ^ancreatir mRNA mom tne same preparation as ^sea r o r - SAGE m Example whs usea is eonst'uci 
a cD\A npran i'~ tno ZA^ Exo r ess vecic r using the ZA- Expmss eG\A Synthesis kit ioliowins tne manufacture: s 
protocol - Straiaqeno i Analysis o* 15 ranacmly seleclea clones indicated tnai 100 C \ ccntainea cDNA inserts Piaies 
containing 250 tc 500 plaques were hypnai/ed as provousiy describee •Support o: al . Mo- Ceil Biol 8 2"! 04 1 9~c 
cE>NA probes tor trypsinoger 1 trypsinoger 2 procarboxypeptidase A cnynot'ypsmogen ana einstase 1MB wee 
Ocnvec Oy R T -rOE from oanerens RNA The trypsinoger. " ana 2 p'encs wo f e G 3'T iaentiCrt : ana hypnci/ea tc the 
Same plaques under re conditions usee Likewise the eiastaso HIB probe anc protease E probe were oyer 95": 
identica arc hycidi/ea to the same plaques 

The relative abundance o ; the SAGE tags for these transcripts was in excellent agreement with the results obtained 
with hb^a r y screening (Figure 2> Furthermore whereas neitne 1 " trypsinoger. 1 and 2 nor elastase NIB and protease E 
could be distinguished oy tne :DN A probes usee to screen the horary ai 1 toe transcripts could readily be distinguished 
on the basis ot their SAGE tags ■ laoie 1 ■ 

EXAMPLE 3 

In addition Ic providing quantitative information on the abundance of known :ranscupls SAGE could be used to 
identify novel expressed genes While foi the put poses o* thu SAGE aririlys s in th.t. example only ttie G bp bequence 
unique to eacn transci pt was considered eacn SAGE tag denned a 13 bp secuence composed of the anchoring 
en/yme '4 bp) site plus the G be tag To i lustrate this ootential 1 3bp oligonucleotides were used to isolate the transcripts 
corresponcinc to four unassiqned tags : P 1 to P4 ; that s tags w thout co^espondmg entrtcs from GenBank RET ; Table 
1 , In each of the four cases it was possible to solatc multiple cDMA clones for the tag by simply screening the pan 
croatic cDNA library us ng 1 3 bp oligonucleotide as hybridization probe (examples in Figure 3j 

Plates containing 250 to 2 000 plagues were hybr dized to oligonucleotide probes using the same conditions pre- 
viously describee for standard probes except that the hybridization temperature was reduced tc room temperature 
Washes were performed in oxSSO'O J °r. SDS for 30 minutes at room temperature The probes cons sted ot 13 bp 
ol.gonucleotides which were labeled with v^P-ATF using T4 polynucleotide kinase In each case sequencing of the 
derived clones identified the cored SAGE tag at the predicted 3' end of tne identified transcript The abundance of 
plaques identified by hybridization with the 13-mers was in good agreement with that predicted by SAGE (Table 2) 
Tags P1 and ^2 were found to correspond to amylase and procarboxypeptidase A2 respectively No entry for 
preprocarboxypeotidase A2 and only a truncated entry' for amylase was present ir GenBank RS7 thus accounting for 
their unassigned characterization Tag P3 did not match any genes of known function in GenBank but did match nu- 
merous EST's providing further evidence that it represented a bona fide transcript The cDNA identified by P4 showed 
no signricant homology suggesting that it represented a previous^ uncharactenzed pancreatic transcript 
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TABLE 2 

Characteri zation of Unassigned SAGE Tags 
Abundance SAGE 



TAG 

PI TCCTCAAAA 
(SEQIDNO:38) 
P2 TGCGAGACC 

(SEQ IDNO:39) 

P3 ACGCAGGGA 

(SEQIDNO:40) 

P4 AATTGAAGA 

(SEQIDNO:41) 



SAGE 

1.7% 

1.1% 
0.6% 
0.6% 



13mer flyb Tag Description 

1.5% (6/388) f 3' end of Pancreatic Amylase (M28443) 



1.2% (43/3700) + 3' end of Prepnxarboxypcptidasc A2 
(U 19977) 



0.2% (5/2772) + EST match (R45 808) 



0 4% (6/1587) + no match 
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"Tag" and "SAGE Abundance" are described in Table 1 : "13mer Hyb" indicates the results obtained by screening 
a cDNA library with a 1 3mer as described above The number of positive plaques dividedby the total plaques screened 
is indicatec in parentheses following the percent abundance A positive in the "SAG? Tag" column indicates that the 
expected SAGE tag sequence was identified near the 3' end o f isolated clones "Description" indicates the results of 
BLAST searches of the daily updated GenBank entries at NCB! a of 6-'9/95 ,'Altschul. el al , J Mol Bioi 215 403. 
1990) A description and Accession number are given for the most significant matches. P1 was found to match a 
truncated entry for amylase and P2 was found to match an unpublished entry for preprocarboxypeptidase A2 which 
was entered after GenBank RB7 

EXAMPLE 4 

Ditags produced by SAGE can be analyzed by PSA or CS. as described in the specification In a preferred em- 
bodiment of PSA the following steps are carried out with ditags 

Ditags are prepared amplified and cleaved with the anchoring enzyme as described in the previous examples 



OOOOOOOOOOXXXXXXXX?CXCATG-3' 
3'-GTACOOOOOOOOOOXXXXXXXXXX 



55 



Four-base oligomers containing an identifier {e.g., a fluorescent moiety. FL) are prepared thai are complementary to 
the overhangs, for example FL-CATG. The FL-CATG oligomers (in excess) ate ligated to the ditags as shown below 

5'-FL-CATGOOOOOOOOOOXXXXXXXXXXCATG 

GTACOOOOOOOOOOXXXXXXXXXXGTAC-FI J -5• 

The ditags are then purified and melted to yield single-stranded DNAs having the formula 
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5'-FL-CATGOOOOOOOOOOOXXXXXXXXXXCATG 

ri ''■ 0 

GTACOOOOOOOOOOXXXXXXXXXXGTAC-FL-5', 

to r example The mixturG c: single-stranded DNAs is p-eferabiy serially- d luted Each serial dilution is hyondized under 
approprate stringency conaitions with solid matrices conta nirg gridded single-stranded oligonucleotides ah cf tne 
ol gonucleotides contain a haft-site of the anchoring enzyme cleavage sequence in tne example used herein the 
ol gonucieotide sequences contain a CATG sequence at the 5 ! end 



CATGOOOOOOOOOO, CATGXXXXXXXXXX, 

etc 

I or allernat vey a CATG sequence at tne 3' end OOOOOOOOOC ATG . 

The matrices can be constructed of any materia! known in the art and the ohgonucleotide-beaimc chips can be 
generated oy any procedure known in the art eg si icon chips containing ol gonucleotides orepared by the VLSIP 
procedure Fodor et al supra) 

The oligonucleotide bearing matrices are evaluated for the presence or absence ot v. fluorescent ditag at each 
positon in :hc grid 

In a preferred embodiment there are 4 10 or 1 045 576 oligonucleotides on the gndis) o ; the genera sequence 
CATGCOCOOOOODC such that every possible 10-oase sequence is represented 3' to the CATG where CATG is 
used as an example of an anchoring enzyme half site that is complementary to the anchorng enzyme nal'' sue at the 
3' end o f the ditag Since there a-e estimated to be no more than 1 DO 300 tc 200 000 different expressed genes in the 
human genome :here are enough oligonucleotide sequences to oetect all of the possible seauences adjacent to the 
3'-most anchoring enzyme site observed in the cDNAs from the expressed genes in the human genome 

In yet another embodiment structures as described above containing tne sequences 



PRIMER A- GGAGCATG (X) 10 (O) I0 CATGCATCC- PRIMER B 

PRIMER A- CCTCGTAC (X) 10 (O) 10 GTACGTAGG- PRIMER B 

are amplified cleaved with tagging enzyme and thereafter with anchoring enzyme to gererate tag complements of the 
structure: (O) 10 CATG-3' which car then oe labeled melted and hybridized with oligonucleotides on a solid suppont 
A determination is made of differentia! expression by comparing the fluorescence profile on the grids at different 
di.uttons among different libraries (representing differential screening probes) Fo" example 
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Library A, Ditags Diluted 1:10 



Library B, Ditags Diluted 1:10 





A 


B 


C 


D 


E 


1 


FL 










2 










FL 


3 




FL 


FL 






4 








FL 




5 


FL 













A 


B 


C 


D 


£ 


1 


FL 










2 






FL 




FL 


3 




FL 


FL 






4 












5 


FL 








FL 



Library A, Ditags Diluted 1 :50 



Library A, Ditags Diluted 1:100 
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B 


C 


D 


E 


1 


FL 










2 












3 




FL 








A 








FL 




5 


FL 













A 


B 


C 


D 


£ 


I 


FL 










2 












3 




FL 








4 








FL 




5 


FL 
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Library B, Ditags Diluted 1 :50 





A 


B 


C 


D 


E 


1 


FL 










2 






FL 






3 




FL 


FL 






4 












5 













Library B, Ditags Diluted 1:100 





A 


B 


C 


D 


E 


1 


FL 










2 






FL 






3 




FI. 








4 












5 
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The individual oligonucleotides Ihus hybridize lo ditags with the following characteristics 

Table 3 



Dilution 


V10 


1 50 


1100 


Lib A 


Lid B 


Lib A 


Lib B 


Lib A 


Lib B 


1 A 






+ 


■+ 


+ 


+ 


2C 








+ 






2E 
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Tcjcie 3 i continued 



I r E i :_J T j ~n 




" 5 L 


1 100 






A 


3 


Lit A 


Lid B 


3B 














3C 














4D 










-f 




oA 




■+ 










5E 















Tab e 3 summarizes the results of :he differential hybr idi/ation Tags hybridizing tc 1 A and 3B reflect h gh y abun- 
dant mRNAs tha; are not differentially expressed (since the tags hybridize tc both libraries at all dilutions:, tag 2C 
identifies a hignly abundan: mPNA but only ir Library B 2E reflects a low abundance transcript since it is only detected 
a' the lowest dilution') that is no- found to be dffe^entialy expressed 30 reflects a moderately abundant transrnot 
fsmc=( it is expressed at the lower tw: d lutions; in Library B that is expressed at ow abundance in Library A 40 reflects 
a differentialiy-exoressed h>gn abundance transcript restricted to -ib^ary A 5A reflects a transcript that is exoressed 
at high abundance m Library A but only at low abundance- in Library B anc 5E reflects a differentially-expressed 
transcript that is detectable only in Library B 

In another PSA embodiment step 3 above does not involve the use of a fluorescent or other identif -or instead at 
the last round of amplification of the ditags labeled dNTPs are used so that after mel:mg half of all molecules ax 
laocl-:d and can serve as prcbe: fo hybridation to ohgenuc leotides fixed on the chips 

i yet another PSA erroodiren msteac of ditags a particular potion of the transcript is used e g the sequence 
between the 3 terminus of the transcript and the first anchoring en/yme site In that particular case a double-stranded 
cDNA reverse transcript is generated as described in the Detailed Description The transcripts are cut with the anchoring 
enzyme a linker is added containing a PGR primer and amplification is initiated (using the prime^ at one end and the 
poly A tail at the other) while the transcripts are still on the strepavidin bead At the last round of amplification fluores- 
cemated dNTPs are used so that half of the molecules are labeled The hnke'-primer can be optionally removed by 
use of the ancnormg enzyme at this point m order to reduce the size cf the fragments The soluble fragments are then 
melted and captured on solid matrices containing CA'GOODOOOOOOO. as in the previous examole Analysis and 
scoring [only of the half of the fragments which contain fl uoresceinated bases) is as described above 

For use ir clonal sequencing, ditags or concatemers would be d luted and added tc welis of multiwel plates for 
example or othe- receptacles so that on average the welis would contain statistically less than one DNA molecule 
pei well (as is done in limited dilution for cell cloning) Each wel would then receive reagents for PGR or another 
amplication process a^d the DNA in each receptacle would be sequencec. eg. by mass spectroscopy The results 
will either be a single sequence ithere having been a single sequence in that receptacle) a "null" sequence (no DNA 
present; or a double sequence (more than one DNA molecule) which would be eliminated from consideration during 
data analysis Thereafter assessment of differential expression would be the same as described herein. 

These results demonstrate that SAGE provides both quantitative and qualitative data about gene express on The 
use o f different anchoring enzymes anchor tagging enzymes with various recognition elements lends great flexibility 
to this strategy In particular since different anchoring enzymes cleave cDNA at different sites, the use of at least 2 
different Aes on different samples of tne same cDNA preparation allows confirmation of results and analysis cf se- 
quences that might not ^ontan r- tecognition sue for one of ihe enzymes 

As efforts to fully characterize the genome near completion. SAGE should allow a direct readout of expression in 
any given cell type or tissue In the interim, a major application of SAGE will be the comparison of gene expression 
patterns in among tissues and in various develoomentai and disease states in a given cell or tissue One of skill in the 
art witn the capability to perforrr PGR and manua^ sequencing could perform SAGE for this purpose Acaptation of 
this techmaue to an automated sequencer would allow the analysis of over i .000 transcripts in a single 3 hour run An 
ABI 377 sequencer can produce a 451 bp readout fo- 36 templates m a 3 hour run (451bp/11bp per tag x 36=1476 
tags) The appropriate number of tags to be determined will depend on the application For example, the definition of 
genes expressed at relatively high levels (0 5°o or more) in one tissue but low in another, would require only a single 
day Determination of transc ripts expressed a. greater ihan 1 00 mRNA s per cell ( 025% or more) should be quantifiable 
within a few months by a single investigator Use of two different Anchoring Enzymes will ensure that virtually all 
transcripts of the desired abundance wiM be identified The genes encoding those tags found to be most interesting on 
the basis of their differential representation can be positively identified by a combination o' : data-base searchinc. hy- 
bridization, and sequence analysis as demonstrated in Table 2 Obviously. SAGE could also be applied to the analysis 
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of organisms other tnan humans and could direct investigation towards genes expressed :n specific biologic states 

SAGE as described herein allows comparison of expression o' numerous nones amonq tissues or among different 
states of development of tne same tissue or between pathologic tissue and its normal counterpart Such analysis is 
useful for identifying therapeutically diagnostically and prognostically relevant genes for example Among the many 
utilities lor SAGE technology is !he identification o ; appropr.ate antisense or triple helix reagents which may be ther- 
apeutical y useful Further gene therapy candidates can also be identifiec by the SAGE technology Other uses include 
diagnostic applications for identification of individual genes or groups of genes whose expression is shown to correlate 
to predisposition to disease the presence of disease and prognosis of disease for example An abundance profile 
such as that depicted in Table 1 is usefu fc the above described applications SAGE is also useful for detection of 
an organism (e g . a pathogen) in a host or detection of infection-specific genes expressed by a pathogen in a host 

The ability to identify a large number of expressed genes m a short period ot time, as described by SAGE n the 
present invention provides unlimited uses 

Although the invention has been describee with reference to the presently preferred embodiment it shou;d be 
understood that various modifications can be made without departing from the spirit of the invention Accordingly the 
invention is limited only by the following claims 



Claims 



1. An isolated oligonucleotide composition having at least two defined nucleotide seguence lags wherein at least 
one tag corresponds to at least one expressed gene 

2. The composition of claim 1 . wherein the oligonucleotide consists of about 1 to 200 ditags 

3. The composition of claim 2 wherein the oligonucleotide consists of about S to 20 ditags 



4. A method for the detection of gene expression comprising 



40 



producing complementary deoxyribonucleic acid (cDNA) oligonucleotides 

isolating a first defined nucleotide seguence tag from a first cDNA oligonucleotide and a second defined nu- 
cleotide seauence tag from a second cDNA oligonucleotide; 

linking the first tag to a first oligonucleotide linker, wherein the first oligonucleotide linker comprises s first 
seguence for hybridization of an amplification primer and linking the second tag to a second oligonucleotide 
linker wnerein tne second oligonucleotide nnKer comprises a second sequence for hybridization of an ampli- 
fication primer: and 

determining the nucleotide seguence of the tag(s) : wherein the tag(s) correspond to an expressed gene 

5. The method of claim 4 further comprising ligating the first tag linked to the first oligonucleotide linker to the second 
tag linked to the second oligonucleotide linker and forming a ditag. 

6. The method of claim 5 further comprising amplifying the ditag oligonucleotide 

7. The method of claim 5 further comprising producing concatemers of the ditags 

8. The method of claim 7 wherein the concatemer consists of about 2 to 200 ditags. 

9. The method of claim £. wherein the concatemer consists of about S to 20 ditags 

10. The method of claim 4 wherein the first and second oligonucleotide linkers comprise the same nucleotide se- 
50 quence 

11. The method of claim 4 wherein the first and second oligonucleotide linkers comprise different nucleotide sequenc 
cs 

55 12. The method of claim 1 1 ; wherein the first and second oligonucleotide linkers have a sequence 
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S^TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG -3' 
3'- j^TGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT -5' 



5'- TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3' 
y- AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -5\ 

wherein A is clideoxy A 

13. The method of claim 4 wherein the inkers comprise a second restriction endonuclease recognition site which 
allows cleavage at a site distant trom tne recognition site 

14. The method of claim 1 3 where n tne second restriction endonuciease is a type IIS endonuclease 

15. The method of claim 14 wherein the type f S endonuclease is selected from :he group consisting of BsmFI and z ok\ 

16. The method of claim 5 wherein the ditag is about 1 2 to GO base oars 

17. The method of claim 16 where n tne ditaq is aoout 15 10 22 base pairs 

18. The method of claim 6 wherein the amplify ng is by polymerase chain reaction (PGR) 

19. The method of claim 1 8 wherein primers for PCF are selected from the grouo consisting of 

5'-CCAGCTTATTCAATTCGGTCC-3' 

and 

5'-GTAGACATTCTAGTATCTCGT-3\ 

20. A method for detection of gene expression comprising 

cleaving a cDNA sample with a first restriction endonuclease wherein the endonuclease cleaves the cDNA 
at a defined position at the 5 1 or 3' terminus o' the cDNA tnereby producing a defined sequence tag. 
isolating the detined 5' or 3' cDNA tag 

hgatmg a first pool of tags with a firs: oligonucleotide linker having a first sequence useful hybridization of an 
amplification primer and ligatmg a second pool of tags with a second oligonucleotide linger having a second 
sequence useful hybridization of an amplication primer 
cleaving the tags with a second restriction endonuclease. 
hgating the two pools of tags to produce a d:tag and 

determining the nucleotide sequence of the tag(s). wherein the tagfs) correspond to a mRNA from an ex- 
pressed gene 

21. The method of claim 20 further comprising amplifying the ditag 

22. The method of claim 20. where-r tne first restriction enoonuclease has at ieast one recognition site in the cDNA 

23. The method of claim 22 wherem tne first restriction enzyme has a four base pair recognition site 
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24. The method ol claim 23 whore n tne restriction endonuclease is Nlalll 

25. The method of claim 20 where n tne cDNA composes a means for capture 

26. The method of claim 25 wherein tne means for capture is a binding element 

27. The method of claim 26 wherein tne binding element is biotin 

28. The method of claim 20 wherein the first and second oligonucleotide linkers comprise the same nucleotide se- 
quence 

29. The method of claim 20 wherein the first anc second oligonucleotide linkers comprise different nucleotide se- 
quences 

30. The method of claim 29 wherein tne first and second oligonucleotide linkers nave a sequence 

5'-TTTTACCAGCTTATTCAATTCGGTCCTCTCGCACAGGGACATG -3' 
3 f - ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT -5' 



5- TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3* 
3'- AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -S\ 

wherein A is dideoxy A 

31 . The method of claim 20, wherein the second restriction endonuclease cleaves at a site distant from the recognition 
site 

32. The methoo of ciairri 31 . wherein tne second restriction eridonuciease is a type iiS er idonuciease 

33. The method of claim 32. wherein the type IIS endonuclease is selected from the group consisting of BsmFI andFoki 

34. The method of claim 20. wherein the ditag is about 12 to 60 base pairs 

35. The method of claim 34. wherein the ditag is about 14 to 22 base pairs 

36. The method of claim 20. further comprising hgating the ditags to produce a concatemer 

37. The method of claim 36. wherein tne concatemer consists of about 2 to 200 ditags 

38. The method of claim 37. wherein the concatemer consists of about 8 to 20 ditags 

39. The method of claim 20. wheiem the amplifying is by polymerase chain reaction (PCR) 

40. The method of claim 39. wherein primers for PCR are selected from the group consisting of 

5*-CCAGCTTATTCAATTCGGTCC-3' 

and 



BNSDOC1D <EP _ 0761822A2 I 



18 




5'-GTAGACATTCTAGTATCTCGT-3\ 

; 41. A m; useiu ! to- cetection o*' aene expression vrcc -~ the p'esence o* a :ONA ditag is matcanve o* exoressicn of 
a qene having a sequence of a taa of tr e citaq the Kit corr prisma one or more containers :onpnsirq a first container 
comainna a firs! oliqonucleottae iinKO^ raving a first seauence useful nycria zatior of an amplification primer a 
second container containing a second ol qonucieolide linker having a second oligonucleotide linker having a sec- 
ond sequence use'ui hybridization 01 an amplication prime wherein tne linkers ; urthe- comprise a restriction 
endonuclease s to 'or cleavage o' DNA at a site cistant from the restriction enoonucleasc re:oqnit on site and a 
tmrd and fourth container having a nucleic acid primers for hybridization to the firs! ana second unique sequences 
of the linker 

42. The kit of claim 41 wherein tne linkers have a sequence 

5'-TTTTACCAGCTTATTCAATTCGGTCCTCTCG€ACAGGGACATG -3' 

3'- ATGGTCGAATAAGTTAAGCCAGGAGAGCGTGTCCCT -5' 

20 

Of 

5'- TTTTTGTAGACATTCTAGTATCTCGTCAAGTCGGAAGGGACATG -3' 

3'- AACATCTGTAAGATCATAGAGCAGTTCAGCCTTCCCT -5', 

wherein A is dideoxy A 

30 43. The kit of claim 41 wherein the restriction endonuclease is a type II S endonuclease 

44. The kst of claim 43 wherein the type IIS endonuclease is BsmFI 

45. The kit of claim 41 wherein the pnmers f or amplification are selected from the group consisting of 

5-CCAGCTTATTCAATTCGGTCC-3' 

40 and 

5'-GTAGACATTCTAGTATCTCGT-3\ 
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FIGURE 1 
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FIGURE 2 
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FIGURE 3 
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